{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "95f0a171",
   "metadata": {},
   "source": [
    "(webscraping-and-apis)=\n",
    "# Webscraping and APIs\n",
    "\n",
    "## Introduction\n",
    "\n",
    "This chapter will show you how to work with online data that is either obtained from webpages via webscraping or more directly over the internet via an API. An important principle is always to use an API if one is available as this is designed to pass information directly into your Python session and will save you a lot of effort."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51a55374",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove cell\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib_inline.backend_inline\n",
    "\n",
    "# Plot settings\n",
    "plt.style.use(\"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt\")\n",
    "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2b3d11c",
   "metadata": {},
   "source": [
    "### Ethical and legal considerations\n",
    "\n",
    "As we've already said, you should use an API whenever you can. But, for the cases when you can't, what are the rules of the game when it comes to webscraping? First, we need to talk about whether it’s legal and ethical for you to do so. Overall, the situation is complicated with regards to both of these.\n",
    "\n",
    "Legalities depend a lot on where you live. However, as a general principle, if the data is public, non-personal, non-commercial, and factual, you’re likely to be ok. These three factors are important because they’re connected to the site’s terms and conditions, personally identifiable information, and copyright, as we’ll discuss below.\n",
    "\n",
    "If the data isn’t public, non-personal, or factual or you’re scraping the data specifically to make money with it, or you're scraping something that is only available commercially, you’ll need to talk to a lawyer. In any case, you should be respectful of the resources of the server hosting the pages you are scraping. Most importantly, this means that if you’re scraping many pages, you should make sure to wait a little between each request. [Tenacity](https://tenacity.readthedocs.io/en/latest/) is an excellent package for this—it has a function that pauses between attempts to do something (as well as a bunch of other useful features).\n",
    "\n",
    "If you look closely, you’ll find many websites include a “terms and conditions” or “terms of service” link somewhere on the page, and if you read that page closely you’ll often discover that the site specifically prohibits web scraping. These pages tend to be a legal land grab where companies make very broad claims. It’s polite to respect these terms of service where possible, but take any claims with a grain of salt.\n",
    "\n",
    "US courts have generally found that simply putting the terms of service in the footer of the website isn’t sufficient for you to be bound by them, e.g., [HiQ Labs v. LinkedIn](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn). Generally, to be bound to the terms of service, you must have taken some explicit action like creating an account or checking a box. This is why whether or not the data is **public** is important; if you don’t need an account to access them, it is unlikely that you are bound to the terms of service. Note, however, the situation is rather different in Europe where courts have found that terms of service are enforceable even if you don’t explicitly agree to them.\n",
    "\n",
    "Even if the data is public, you should be extremely careful about scraping personally identifiable information like names, email addresses, phone numbers, dates of birth, etc. Europe has strict laws about the collection or storage of such data (known as GDPR), and regardless of where you live you’re likely to be entering an ethical quagmire. For example, in 2016, a group of researchers scraped public profile information (e.g., usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts at anonymisation. While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset. If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study3 as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.\n",
    "\n",
    "Finally, you also need to worry about copyright law. Copyright law is complicated. US law describes exactly what’s protected: “[…] original works of authorship fixed in any tangible medium of expression, […]”. It then goes on to describe specific categories that it applies like literary works, musical works, motion pictures and more. Notably absent from copyright protection are data. This means that as long as you limit your scraping to facts, copyright protection does not apply. (But note that Europe has a separate “sui generis” right that protects databases.)\n",
    "\n",
    "As a brief example, in the US, lists of ingredients and instructions are not copyrightable, so copyright can not be used to protect a recipe. But if that list of recipes is accompanied by substantial novel literary content, that is copyrightable. This is why when you’re looking for a recipe on the internet there’s always so much content beforehand.\n",
    "\n",
    "If you do need to scrape original content (like text or images), you may still be protected under the doctrine of fair use. Fair use is not a hard and fast rule, but weighs up a number of factors. It’s more likely to apply if you are collecting the data for research or non-commercial purposes and if you limit what you scrape to just what you need."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "17575f3a",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "You will need to install the **pandas** package for this chapter. We'll use **seaborn** too, which you should already have installed. You will also need to install the **beautifulsoup** and **pandas-datareader** packages in your terminal using `pip install beautifulsoup4` and `pip install pandas-datareader` respectively. We'll also use two built-in packages, **textwrap** and **requests**.\n",
    "\n",
    "To kick off, let's import some of the packages we need (it's always good practice to import the packages you need at the top of a script or notebook)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7f62293",
   "metadata": {},
   "outputs": [],
   "source": [
    "import textwrap\n",
    "\n",
    "import pandas as pd\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "from lets_plot import *\n",
    "\n",
    "LetsPlot.setup_html()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f43a5237",
   "metadata": {},
   "source": [
    "## Extracting Data from Files on the Internet using **pandas**\n",
    "\n",
    "It's easy to read data from the internet once you have the url and file type. Here, for instance, is an example that reads in the 'storms' dataset, which is stored as a CSV file in a URL (we'll only grab the first 10 rows):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06108a4d",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv(\n",
    "    \"https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv\", nrows=10\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4223e854",
   "metadata": {},
   "source": [
    "## Obtaining data using APIs\n",
    "\n",
    "Using an API (application programming interface) is another way to draw down information from the interweb. Their just a way for one tool, say Python, to speak to another tool, say a server, and usefully exchange information. The classic use case would be to post a request for data that fits a certain query via an API and to get a download of that data back in return. (You should always preferentially use an API over webscraping a site.)\n",
    "\n",
    "Because they are designed to work with any tool, you don't actually need a programming language to interact with an API, it's just a *lot* easier if you do.\n",
    "\n",
    "```{note}\n",
    "An API key is needed in order to access some APIs. Sometimes all you need to do is register with site, in other cases you may have to pay for access.\n",
    "```\n",
    "\n",
    "To see this, let's directly use an API to get some time series data. We will make the call out to the internet using the **requests** package.\n",
    "\n",
    "An API has an 'endpoint', the base url, and then a URL that encodes the question. Let's see an example with the ONS API for which the endpoint is \"https://api.beta.ons.gov.uk/v1/\". The rest of the API has the form 'data?uri=' and then the long ID of both the timeseries (jp9z) and then the dataset (LMS), which is vacancies in the UK services sector.\n",
    "\n",
    "The data that are returned by APIs are typically in JSON format, which looks a lot like a nested Python dictionary and its entries can be accessed in the same way--this is what is happening when getting the series' title in the example below. JSON is not good for analysis, so we'll use **pandas** to put the data into shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4226d67",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://api.beta.ons.gov.uk/v1/data?uri=/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/timeseries/jp9z/lms/previous/v108\"\n",
    "\n",
    "# Get the data from the ONS API:\n",
    "json_data = requests.get(url).json()\n",
    "\n",
    "# Prep the data for a quick plot\n",
    "title = json_data[\"description\"][\"title\"]\n",
    "df = (\n",
    "    pd.DataFrame(pd.json_normalize(json_data[\"months\"]))\n",
    "    .assign(\n",
    "        date=lambda x: pd.to_datetime(x[\"date\"]),\n",
    "        value=lambda x: pd.to_numeric(x[\"value\"]),\n",
    "    )\n",
    "    .set_index(\"date\")\n",
    ")\n",
    "\n",
    "df[\"value\"].plot(title=title, ylim=(0, df[\"value\"].max() * 1.2), lw=3.0);"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "670ce0bb",
   "metadata": {},
   "source": [
    "We've talked about *reading* APIs. You can also create your own to serve up data, models, whatever you like! This is an advanced topic and we won't cover it; but if you do need to, the simplest way is to use [Fast API](https://fastapi.tiangolo.com/). You can find some short video tutorials for Fast API [here](https://calmcode.io/fastapi/hello-world.html).\n",
    "\n",
    "### Pandas Datareader: an easier way to interact with (some) APIs\n",
    "\n",
    "Although it didn't take much code to get the ONS data, it would be even better if it was just a single line, wouldn't it? Fortunately there are some packages out there that make this easy, but it does depend on the API (and APIs come and go over time).\n",
    "\n",
    "By far the most comprehensive library for accessing extra APIs is [**pandas-datareader**](https://pandas-datareader.readthedocs.io/en/latest/), which provides convenient access to:\n",
    "\n",
    "- FRED\n",
    "- Quandl\n",
    "- World Bank\n",
    "- OECD\n",
    "- Eurostat\n",
    "\n",
    "and more.\n",
    "\n",
    "Let's see an example using FRED (the Federal Reserve Bank of St. Louis' economic data library). This time, let's look at the UK unemployment rate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf758fb4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas_datareader.data as web\n",
    "\n",
    "df_u = web.DataReader(\"LRHUTTTTGBM156S\", \"fred\")\n",
    "\n",
    "df_u.plot(title=\"UK unemployment (percent)\", legend=False, ylim=(2, 6), lw=3.0);"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0613aefb",
   "metadata": {},
   "source": [
    "And, because it's also a really useful one, let's also see how to use **pandas-datareader** to access World Bank data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "380ca743",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "# World Bank total greenhouse emissions (CO2e metric tons per capita)\n",
    "# https://data.worldbank.org/indicator/EN.GHG.ALL.PC.CE.AR5\n",
    "# country and region codes at http://api.worldbank.org/v2/country\n",
    "from pandas_datareader import wb\n",
    "\n",
    "df = wb.download(  # download the data from the world bank\n",
    "    indicator=\"EN.GHG.ALL.PC.CE.AR5\",  # indicator code\n",
    "    country=[\"US\", \"CHN\", \"IND\", \"Z4\", \"Z7\"],  # country codes\n",
    "    start=2019,  # start year\n",
    "    end=2019,  # end year\n",
    ")\n",
    "df = df.reset_index()  # remove country as index\n",
    "df[\"country\"] = df[\"country\"].apply(lambda x: textwrap.fill(x, 10))  # wrap long names\n",
    "df = df.sort_values(\"EN.GHG.ALL.PC.CE.AR5\")  # re-order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00379b93",
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    ggplot(df, aes(x=\"country\", y=\"EN.ATM.CO2E.PC\"))\n",
    "    + geom_bar(aes(fill=\"country\"), color=\"black\", alpha=0.8, stat=\"identity\")\n",
    "    + scale_fill_discrete()\n",
    "    + theme_minimal()\n",
    "    + theme(legend_position=\"none\")\n",
    "    + ggsize(600, 400)\n",
    "    + labs(\n",
    "        subtitle=\"Carbon dioxide (metric tons per capita)\",\n",
    "        title=\"The USA leads the world on per-capita emissions\",\n",
    "        y=\"\",\n",
    "    )\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b7bf16d7",
   "metadata": {},
   "source": [
    "### The OECD API\n",
    "\n",
    "Sometimes it's convenient to use APIs directly, and, as an example, the OECD API comes with a LOT of complexity that direct access can take advantage of. The OECD API makes data available in both JSON and XML formats, and we'll use [**pandasdmx**](https://pandasdmx.readthedocs.io/) (aka the Statistical Data and Metadata eXchange (SDMX) package for the Python data ecosystem) to pull down the XML format data and turn it into a regular **pandas** data frame.\n",
    "\n",
    "Now, key to using the OECD API is knowledge of its many codes: for countries, times, resources, and series. You can find some broad guidance on what codes the API uses [here](https://data.oecd.org/api/sdmx-ml-documentation/) but to find exactly what you need can be a bit tricky. Two tips are:\n",
    "1. If you know what you're looking for is in a particular named dataset, eg \"QNA\" (Quarterly National Accounts), put `https://stats.oecd.org/restsdmx/sdmx.ashx/GetDataStructure/QNA/all?format=SDMX-ML` into your browser and look through the XML file; you can pick out the sub-codes and the countries that are available.\n",
    "2. Browse around on https://stats.oecd.org/ and use Customise then check all the \"Use Codes\" boxes to see whatever your browsing's code names.\n",
    "\n",
    "Let's see an example of this in action. We'd like to see the productivity (GDP per hour) data for a range of countries since 2010. We are going to be in the productivity resource (code \"PDB_LV\") and we want the USD current prices (code \"CPC\") measure of GDP per employed worker (code \"T_GDPEMP) from 2010 onwards (code \"startTime=2010\"). We'll grab this for some developed countries where productivity measurements might be slightly more comparable. The comments below explain what's happening in each step."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "521d4d03",
   "metadata": {},
   "source": [
    "```python\n",
    "import pandasdmx as pdmx\n",
    "# Tell pdmx we want OECD data\n",
    "oecd = pdmx.Request(\"OECD\")\n",
    "# Set out everything about the request in the format specified by the OECD API\n",
    "data = oecd.data(\n",
    "    resource_id=\"PDB_LV\",\n",
    "    key=\"GBR+FRA+CAN+ITA+DEU+JPN+USA.T_GDPEMP.CPC/all?startTime=2010\",\n",
    ").to_pandas()\n",
    "\n",
    "df = pd.DataFrame(data).reset_index()\n",
    "df.head()\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e5cac233",
   "metadata": {},
   "source": [
    "|   | LOCATION |  SUBJECT | MEASURE | TIME_PERIOD |        value |\n",
    "|--:|---------:|---------:|--------:|------------:|-------------:|\n",
    "| 0 |      CAN | T_GDPEMP |     CPC |        2010 | 78848.604088 |\n",
    "| 1 |      CAN | T_GDPEMP |     CPC |        2011 | 81422.364748 |\n",
    "| 2 |      CAN | T_GDPEMP |     CPC |        2012 | 82663.028058 |\n",
    "| 3 |      CAN | T_GDPEMP |     CPC |        2013 | 86368.582158 |\n",
    "| 4 |      CAN | T_GDPEMP |     CPC |        2014 | 89617.632446 |"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "302326b4",
   "metadata": {},
   "source": [
    "Great that worked! We have data in a nice tidy format."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7849ac74",
   "metadata": {},
   "source": [
    "### Other Useful APIs\n",
    "\n",
    "- There is a regularly updated list of APIs over at this [public APIs repo on github](https://github.com/public-apis/public-apis). It doesn't have an economics section (yet), but it has a LOT of other APIs.\n",
    "- Berkeley Library maintains a [list of economics APIs](https://guides.lib.berkeley.edu/c.php?g=4395&p=7995952) that is well worth looking through.\n",
    "- [NASDAQ Data Link](https://docs.data.nasdaq.com/), which has a great deal of [financial data](https://docs.data.nasdaq.com/docs/data-organization).\n",
    "- [DBnomics](https://db.nomics.world/): publicly-available economic data provided by national and international statistical institutions, but also by researchers and private companies."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cd85df34",
   "metadata": {},
   "source": [
    "## Webscraping\n",
    "\n",
    "Webscraping is a way of grabbing information from the internet that was intended to be displayed in a browser. But it should only be used as a last resort, and only then when permitted by the terms and conditions of a website.\n",
    "\n",
    "If you're getting data from the internet, it's much better to use an API whenever you can: grabbing information in a structure way is *exactly* why APIs exist. APIs should also be more stable than websites, which may change frequently. Typically, if an organisation is happy for you to grab their data, they will have made an API expressly for that purpose. It's pretty rare that there's a major website which *does* permit webscraping but which doesn't have an API; for these websites, if they don't have an API, chances scraping is against their terms and conditions. Those terms and conditions may be enforceable by law (different rules in different countries here, and you really need legal advice if it's not unambiguous as to whether you can scrape or not.)\n",
    "\n",
    "There are other reasons why webscraping is not so good; for example, if you need a back-run then it might be offered through an API but not shown on the webpage. (Or it might not be available at all, in which case it's best to get in touch with the organisation or check out WaybackMachine in case they took snapshots).\n",
    "\n",
    "So this book is pretty down on webscraping as there's almost always a better solution. However, there are times when it is useful.\n",
    "\n",
    "If you do find yourself in a scraping situation, be really sure to check that's legally allowed and also that you are not violating the website's `robots.txt` rules: this is a special file on almost every website that sets out what's fair play to crawl (conditional on legality) and what robots should not go poking around in.\n",
    "\n",
    "In Python, you are spoiled for choice when it comes to webscraping. There are five very strong libraries that cover a real range of user styles and needs: **requests**, **lxml**, **beautifulsoup**, **selenium**, and *scrapy**.\n",
    "\n",
    "For quick and simple webscraping, my usual combo would **requests**, which does little more than go and grab the HTML of a webpage, and **beautifulsoup**, which then helps you to navigate the structure of the page and pull out what you're actually interested in. For dynamic webpages that use javascript rather than just HTML, you'll need **selenium**. To scale up and hit thousands of webpages in an efficient way, you might try **scrapy**, which can work with the other tools and handle multiple sessions, and all other kinds of bells and whistles... it's actually a \"web scraping framework\".\n",
    "\n",
    "It's always helpful to see coding in practice, so that's what we'll do now, but note that we'll be skipping over a lot of important detail such as user agents, being 'polite' with your scraping requests, being efficient with caching and crawling.\n",
    "\n",
    "In lieu of a better example, let's scrape the research page of [http://aeturrell.com/](http://aeturrell.com/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "073d89e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"http://aeturrell.com/research\"\n",
    "page = requests.get(url)\n",
    "page.text[:300]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f8e85230",
   "metadata": {},
   "source": [
    "Okay, what just happened? We asked requests to grab the HTML of the webpage and then printed the first 300 characters of the text that it found.\n",
    "\n",
    "Let's now parse this into something humans can read (or can read more easily) using beautifulsoup:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22f96be0",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup = BeautifulSoup(page.text, \"html.parser\")\n",
    "print(soup.prettify()[60000:60500])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5748e928",
   "metadata": {},
   "source": [
    "Now we see more structure of the page and even some *HTML tags* such as 'title' and 'link'. Now we come to the data extraction part: say we want to pull out every paragraph of text, we can use beautifulsoup to skim down the HTML structure and pull out only those parts with the paragraph tag ('p').\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d82775de",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get all paragraphs\n",
    "all_paras = soup.find_all(\"p\")\n",
    "# Just show one of the paras\n",
    "all_paras[1]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2936677e",
   "metadata": {},
   "source": [
    "Although this paragraph isn't too bad, you can make this more readable by stripping out HTML tags altogether with the `.text` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11321154",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_paras[1].text"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9d9d890e",
   "metadata": {},
   "source": [
    "Now let's say we didn't care about most of the page, we *only* wanted to get hold of the names of projects. For this we need to identify the tag type of the element we're interested in, in this case 'div', and it's class type, in this case \"project-name\". We do it like this (and show nice text in the process):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "feac1fdd",
   "metadata": {},
   "outputs": [],
   "source": [
    "projects = soup.find_all(\"div\", class_=\"project-content listing-pub-info\")\n",
    "projects = [x.text.strip() for x in projects]\n",
    "projects"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "461271a3",
   "metadata": {},
   "source": [
    "Hooray! We managed to get the information we wanted: all we needed to know was the right tags. A good tip for finding the tags of the info you want is to look at in your browser (eg Google Chrome) and then right-click on the bit you're interested in, then hit 'Inspect'. This will show you the HTML element of the bit of the page you clicked on.\n",
    "\n",
    "That's almost it for this very, very brief introduction to webscraping. We'll just see one more thing: how to iterate over multiple pages.\n",
    "\n",
    "Imagine we had a root webpage such as \"www.codingforeconomists.com\" which had subpages such as \"www.codingforeconomists.com/page=1\", \"www.codingforeconomists.com/page=2\", and so on. One need only iterate create the HTML strings to pass into a function that scrapes each one and return the relevant data, eg for the first 50 pages, and with a function called `scraper()`, one might run\n",
    "\n",
    "```\n",
    "start, stop = 0, 50\n",
    "root_url = \"www.codingforeconomists.com/page=\"\n",
    "info_on_pages = [scraper(root_url + str(i)) for i in range(start, stop)]\n",
    "```\n",
    "\n",
    "That's all we'll cover here but remember we've barely *scraped* the surface of this big, complex topic. If you want to read about an application, it's hard not to recommend the paper on webscraping that has undoubtedly change the world the most, and very likely has affected your own life in numerous ways: [\"The PageRank Citation Ranking: Bringing Order to the Web\"](http://ilpubs.stanford.edu:8090/422/) by Page, Brin, Motwani and Winograd. For a more in-depth example of webscraping, check out realpython's [tutorial](https://realpython.com/python-web-scraping-practical-introduction/)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "52ce838c",
   "metadata": {},
   "source": [
    "### Webscraping Tables\n",
    "\n",
    "Often there are times when you don't actually want to scrape an entire webpage and all you want is the data from a *table* within the page. Fortunately, there is an easy way to scrape individual tables using the **pandas** package.\n",
    "\n",
    "We will read data from the first table on 'https://simple.wikipedia.org/wiki/FIFA_World_Cup' using **pandas**. The function we'll use is `read_html()`, which returns a list of data frames of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.\n",
    "\n",
    "The example below shows how this works; looking at the website, we can see that the table we're interested in (of past world cup results), has a 'fourth place' column while other tables on the page do not. Therefore we run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ada9ce7",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_list = pd.read_html(\n",
    "    \"https://simple.wikipedia.org/wiki/FIFA_World_Cup\", match=\"Sweden\"\n",
    ")\n",
    "# Retrieve first and only entry from list of data frames\n",
    "df = df_list[0]\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "31e49317",
   "metadata": {},
   "source": [
    "This gives us the table neatly loaded into a **pandas** data frame ready for further use."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "formats": "md:myst",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "toc-showtags": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
